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Abstract 

Quantitative genetic studies that model complex, multivariate phenotypes 
are important for both evolutionary prediction and accurate livestock selection. 
For example, changes in gene expression can provide insight into developmen- 
tal and physiological mechanisms that link genotype and phenotype. However, 
classical analytical techniques are poorly suited to quantitative genetic studies 
of gene expression where the number of traits assayed per individual can reach 
many thousand. Here, we derive a Bayesian sparse factor model for estimating 
the genetic covariance matrix (G-matrix) of high-dimensional traits, such as 
gene expression. The key idea of our model is that we need only consider G- 
matrices that are biologically plausible. An organism's entire phenotype is the 
result of developmental processes that are modular and have limited complex- 
ity. This implies that the G-matrix will be highly structured. In particular, we 
assume that a limited number of intermediate traits (or factors, e.g., variations 
in development or physiology) control the variation in the high-dimensional 
phenotype, and that each of these intermediate traits is sparse - affecting only 
a few measured traits. The advantages of this approach arc three-fold. First, 
sparse factors are interpretable and provide biological insight into mechanisms 
underlying the genetic architecture. Second, enforcing sparsity helps prevent 
sampling errors from swamping out the true signal in high-dimensional data. 
Third, our Bayesian analysis automatically provides credible intervals for the 
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heritability of measured and intermediate traits. We demonstrate the advan- 
tages of our model on simulated data and re-analyze gene expression data from 
a natural population of Drosophila melanog aster. 

Keywords G-matrix, factor model, sparsity, Bayesian inference, animal 
model 



1 Introduction 

Understanding evolutionary change requires knowledge of phenotypes in a popula- 



tion, as well as the genetic architecture underlying phenotypic variation (HoULE 



2010). It is well known in quantitative genetics that multiple correlated traits need 



to be modeled jointly to avoid unexpected or counterintuitive evolutionary outcomes 



(Walsh and Blows, 2009). While most evolutionary studies have focused on ex- 
ternal traits, such as morphology or coloration, there is an increasing effort to collect 
more comprehensive phenotypic information. Traits such as behavior, development 
or physiology, and molecular signatures such as cellular metabolism are also impor- 
tant for determining fitness and driving evolution. Including a comprehensive set of 
phenotypes is likely to enrich evolutionary studies and lead to more accurate livestock 
and crop selection. However, these traits tend to be tightly associated at several lev- 
els: morphology is the output of development; behavior drives and can be driven 
by physiology; cellular metabolism powers growth. Applying the tools of quanti- 
tative genetics to these high-dimensional and highly correlated datasets introduces 
considerable analytical and computational challenges. In this paper we formulate a 
modeling framework to address these challenges. 

The most basic quantitative genetics analysis partitions total phenotypic varia- 



tion into genetic and environmental components (LYNCH and Walsh, 1998). The 
central quantity of interest in many analyses is the matrix of additive genetic vari- 
ances and covariances among traits, called the G-matrix. The G-matrix encodes 
information about evolutionary potential in a set of traits. This is highlighted by 



the Breeder's or Lande equation (Lande, 1979), which states that when selection is 



applied to a set of traits, the expected response is the G-matrix times the selection 
gradient - the G-matrix rotates and scales the selection gradient and can potentially 
shift the direction of change in each trait. An eigendecomposition of the G-matrix 
provides important insight into this process. Selection gradients aligned with eigen- 
vectors corresponding to large eigenvalues are expected to have a strong response. In 
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particular, the eigenvector with the largest eigenvalue has been called the line of ge- 
netic least resistance and is thought to bias evolutionary change (Schluter, 1996). 
Similarly, if the G-matrix is singular then certain selection directions (in particular, 
selection vectors in the nullspace) will be ineffective even if all the traits selected 
on are variable. Such constraints are reduced when the G-matrix is highly modular. 
Modularity in the G-matrix - groups of traits that are genetically correlated but 
uncorrelated with other traits - captures both pleiotropy as well as traits that evolve 
independently (Cheverud, 1996). Estimating the G-matrix is thus a key step in 
many quantitative genetic analyses. 

There has been increasing interest in including gene expression as a trait in evolu- 



tionary analyses (Ayroles et al, 2009 McGRAW et ai, 2011 Gibson and Weir 



2005). Genome- wide gene expression assays target thousands of traits simultane- 



ously and provide a means to monitor cellular and developmental traits that would 
otherwise be difficult to measure. Databases of gene function and gene expression 
signatures of cellular perturbations, stresses or disease states are continually growing, 
and have been successful in turning gene expression measurements into biologically 
insights. Using expression as a trait is also appealing since variation in expression is 
expected to have a simpler genetic basis than fitness or yield, yet can have a large 
effect on fitness. Genetic analyses of gene expression phenotypes can also identify 
sets of genetically co-variable transcripts and infer novel molecular networks. 

The challenge in scaling methods developed in quantitative genetics to hundreds 
or thousands of traits is primarily methodological. The number of parameters re- 
quired to infer the G-matrix grows as p{p + l)/2, where p is the number of traits. 
The situation is compounded if we are also required to model environmental varia- 
tion or measurement error (KiRKPATRiCK and MEYER, 2004). The huge number of 
parameters coupled with modest numbers of individuals in evolutionary studies can 
lead to instability in parameter estimates. Extracting biological insight from a large 
matrix is also a challenge. Individual genetic variances or covariances can be highly 
misleading if not considered with respect to the overall structure of the G-matrix 



even for traits with strong genetic variances or covariances (Walsh and Blows 



2009; HiNE and Blows 2006). 



Our objective in this paper is to develop a model for estimating G-matrices that 
is scalable to large numbers of traits and is applicable to a wide range of datasets, 
including both experimental crosses and pedigreed populations. Previous methods 
for estimating additive genetic variation and covariation can be categorized as fol- 
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lows: (1) pairwise estimates of genetic covariation followed by clustering (Ayroles 



et al. 2009 Stone and Ayroles 2009), (2) methods based on moments estimators 



( |HlNE and Blows| |2006[ |Mcgraw et al.\ |2011[ ), and (3) methods based on mixed 



effects models (Henderson, 1984; Kruuk, 2004 Kirkpatrick and Meyer, 2004 



DE Los Campos and Gianola 2007). The shortcoming of the pairwise approach 



(Ayroles et al., 2009) is that simply collecting pairwise covariance estimates will 
not in general result in a proper covariance matrix - the inferred G-matrix may 



not be positive (semi) definite. Methods based on moments estimators (HiNE and 



Blows, 2006; Mcgraw et al. 2011) are generally not flexible enough to model 



more involved experimental designs such as wild populations or large breeding pro- 
grams. Estimators based on the "Animal Model" (Henderson, 1984) address the 
above problems by fitting a linear mixed model (LMM) using pedigree information 
to partition the observed phenotypic variance of a trait into various genetic and envi- 
ronmental components. The LMM can be applied to a much broader range of experi- 
mental designs and studies (Kruuk, 2004), and produces estimates in the parameter 
space. However, these methods are computationally costly for high-dimensional data. 
Dimension-reduction approaches such as principle components analysis on the high- 
dimensional trait vector followed by univariate analysis ( Biswas et al. , 2008 ) can 
reduce computation demands, but are problematic if there is significant environmen- 



tally induced covariation among traits (Meyer and Kirkpatrick, 2010 Mcgraw 



et al. . 2011). Efficient mixed model approaches for moderate-dimension data include 



using Restrictred Maximum Likelihood (REML) to fit the eigenvectors corresponding 
to the largest eigenvalues of the G-matrix (Kirkpatrick and Meyer 2004), and a 
Bayesian approach constraining the G-matrix to take the form of a factor model with 
a limited number of latent traits (Ide Los Campos and GianolaI 120071). However, 



neither of these LMM approaches (Kirkpatrick and Meyer 2004 de Los Cam- 



pos and Gianola, 2007) as formulated will scale to high- dimensional trait data. 



The key idea that allows us to infer G-matrices for large numbers of traits is 
that the matrix of additive genetic variation will likely be both sparse and modular. 
Here, sparsity means that many of the values in the G-matrix (or a factorization 
of the matrix) will be zero, and modular means that groups of traits will covary 
together. Our a priori assumption is that the G-matrix is composed of only a few 
modules (factors), and that few (sparse) traits will have significant effects in each 
module. We therefore constrain the class of covariance matrices we can estimate, a 
necessary procedure for inference of covariance matrices given high-dimensional data 
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The biological argument behind this prior assumption starts with the observa- 
tion that the phenotypes that we measure in an organism are formed by a shared 
developmental process and this developmental process has limited complexity. For 
gene expression, regulatory networks and functional pathways control gene expression 
and variation in gene expression can be often linked to genetic variation in pathways 
( [XlONG et al.\ |2012t |de la Cruz et al\ |2010[ ). For a given dataset, we make two 
assumptions about these pathways: (1) a limited number of pathways are relevant 
for trait variation and (2) each pathway affects a limited number of genes. There 
is support and evidence for these modehng assumptions in the quantitative genetics 
literature as G-matrices tend to be highly structured (Walsh and Blows, 2009) 
and the majority of genetic variation is contained in a few dimensions regardless of 
the number of traits studied ( [Ayrqles et al.\ |2009t [McGRAW et al.\ |2011[ ). 

In this paper we provide a Bayesian sparse factor model for inferring G-matrices 
from pedigree information for hundreds or thousands of traits. This model is an 
extension of the classic multivariate animal model, and so is highly flexible. We 
demonstrate the advantages of the model on simulated data and re-analyze gene ex- 



pression data from a published study on Drosophila melanogaster (Ayroles et al. 



2009). Although high-dimensional sparse models have been widely used in genetic 



association studies (Cantor et al, 2010 Engelhardt and Stephens, 2010 Ste- 



gle et a/. 2010; Parts et al, 2011 Zhou and Stephens, 2012) to our knowledge. 



sparsity has not been applied to estimating a G-matrix. 



2 Methods 

We will specify a Bayesian factor model that encodes the two main biological assump- 
tions we make on the G-matrix: sparsity in the number of factors comprising the 
matrix, and sparsity in each factor - each factor is comprised of a few components. 
This factor model is designed to address the high-dimensional setting where hun- 
dreds or thousands of traits are simultaneously examined. The sparsity assumption 
is the key feature in our model that allows us to scale stable and accurate inference 
to a very large number of traits. For high-dimensional models sparsity helps prevent 
sampling errors from swamping out the true signal in data leading to stable parame- 
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ter estimates. In our model, sparsity implies that each underlying trait will effect few 
of the observable phenotypes and as a result many of the parameters in the model 
will be (near) zero. 



2.1 Model: 

We derive the Bayesian sparse factor model as an extension to the classic multivariate 
animal model. For a single trait the following linear mixed effects model is used to 



explain phenotypic variation (Henderson, 1984): 



Ji = Xhi + ZUi + Bi (1) 

where is the vector of phenotype measurements for trait i on n individuals, b is a 
vector of fixed effects and environmental covariates such as sex or age, with design 
matrix X, Uj is the random vector of additive genetic effects with covariance afA, 
where A is the known additive relationship matrix among the individuals and Z 
relates the random effects to the observations, and are error which are assumed 
to be independent of the additive genetic effects. 

In going from one trait to p traits we can stack the vector for each trait in ([T| 
to a n X p matrix Y specified by the following linear mixed effects model: 

Y = XB + ZU + E (2) 

where the random terms U = [ui . . . Up] and E = [ei . . . Bp] are drawn from matrix 
normal distributions 

U~MN,,p(0;A,G), E ~ MN„,p(0; I„, R), (3) 

where is the nxp matrix of zeros, G and R are the pxp matrices modeling genetic 
and residual covariances among traits, and A is the known additive relationship 
matrix among the individuals, with rank r < n. The matrix normal distribution is 
defined as 

„rv I M O 5]^ exp(-itr[17-V^(V-MfS-^(V-M)]) 

We model the genetic and residual covariance matrices with a factor structure 

G = A«S„A^ + *«, 

T (4) 
R = AeSeA^ + 
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where A„ and Ag are pxk^ and pxke factor loading matrices, S„ and Sg 

and fee X ke diagonal matrices, and and are p x p diagonal matrices. We can 

now specify U and E via the following hierarchical model 

U = F„A^ + A, E = F,AJ + H 
F„~MN,,,„(0;A,SJ, ~ MN„,fc^(0; I„, S^) (5) 
A ~ MN,,p(0; A, * J, H ~ MN„,p(0; I„, 

In the above model F„ and Fe are latent traits corresponding to additive genetic 
effects and residuals, respectively, and S„ and Se model the covariances of these 
latent effects. 

In (|5]), we assume that the underlying genetic and residual effects are unique and 
fit the factors A^^ and Ag separately. This need not be the case since some factors 
may influence both U and E. 

To generalize to factors driving both genetic and residual effects we rewrite ^ 

as 

Y = XB + FA^ + ZA + H (6) 

where F = ZF* + F* is a single set of latent traits with F* = [F„ Or,ke] and F* = 
[On,ku Fe], and A = [Au Ag] with k = ku + kg. We can specify F via the following 
matrix normal distributions 



MN.JO,A 



S„ 






Eg 



(7) 



where S„ = Diag(cr^^,) and Eg = Diag(crg^,). 

In addition to inference of the latent traits F themselves, we would like to infer 
the heritability of each latent trait. By ([T]) the columns of F are independent, and 
by marginalizing over F^^ and Fg we obtain the distribution of each factor as 



f,~N„(0,<,ZAZ^ + agy„). 



Heritability of the latent trait f^, is the ratio of its additive genetic variance to its 
total (phenotypic) variance: h"^ = ^2 ■ Reparameterizing the above distribution 
leads to 



K + - N.(0' ^'ZAZ^ + (1 - h])W. (9) 
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This makes heritability explicit as an inference parameter. The relevance of this 
is that the latent traits themselves are likely to have both genetic and non-genetic 
influences, like any other trait. Without loss of generality we can scale Aj by the 
phenotypic standard deviation of factor j, y/a^~-|-a^, and renormalize a^^. +(^1- = 1- 
In this formulation, the key matrices G and R can be recovered as: 

G = Adiag(^2)A^ + *„ 
R = A diag(l - h^) A^ + 

2.2 Prior specification: 

The model specified in (|6]) is identical to the standard multivariate mixed model (|2]). 



It is clear in modeling high- dimensional data (Meyer and Kirkpatrick, 2010) 
that that stability and accuracy of parameter estimates is a serious problem and 
some prior specification or penalty/regularization is required for robust estimates - 
that is constraints on G and R are required. We impose constraints on G and R 
through highly informative priors on A. Our priors are based on two key biological 
assumptions based on the idea that the genetic and residual covariances arise from 
variation in underlying developmental processes which are driven by gene networks 
or metabolic pathways. This implies: 

(1) The biological system has limited complexity - a limited number of pathways 
are relevant for trait variation, k <^ p. For the model this means that the number of 
factors is low. 

(2) Each underlying developmental pathway affects a limited number of traits. For 
the model this means the factor loadings are sparse. 



We formalize the above assumptions by priors on A that impose sparsity ( Bhat- 



TACHARYA and DuNSON, 2011). Sparsity in A will impose constraints on the genetic 
and residual covariance matrices. Sparsity on the loadings, assumption (2), is im- 
posed by a heavy tailed distribution on elements Xim of the factor loadings which 
favors small values while allowing a few large values. Limits on the number of fac- 
tors, assumption (1), is imposed by a prior that shrinks the magnitude of successive 
factors. The prior is specified as a hierarchical distribution on each element Xim of 
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A: 



Aim I ~ N (O, 0j^r, 



m 



-1 



) 



m 



T, 




(11) 



^1 ~ Ga(ai, 

5; ~ Ga(a2, 62) for / = 2, /c. 



Shrinkage on the number of factors is imposed by the parameter which increasingly 
shrinks all elements of high-index columns of A. This shrinkage is induced by the 
the stochastically increasing product of the sequence {61}. Sparsity in the factor 
loadings is controlled by (pim which governs the precisions of Afm- Conditional on 
Tm, marginalizing out the (pim leads to a t-distribution for each Ajm with z/ degrees 
of freedom - this is the heavy tail that imposes sparsity on the factor loadings. 

For heritability of each latent trait we set as a prior a discrete set of values in the 
unit interval. This was done for computational efficiency. 



In principle, we could place a prior on the interval [0, 1], but such a prior would not 
be conjugate, and so coding a MCMC sampler slightly more difficult. 

We place gamma priors on each of the inverse variances on the diagonals of 
and \l/e. Priors on each element of B are normal distributions with very large (> 10® 
) variances. 

2.3 Implementation: 

Inference in the above model uses an adaptive Gibbs sampler for which we provide 
detailed steps in the appendix. The code has been implemented in Matlab® and can 
be found at the website (http:/ /stat. duke. edu/^sayan/quantmod. html) . 



7r{hf = l/uh) = l/rih, where / = . . . (uh — 1) 



(12) 
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3 Results 



3.1 Simulation example: 

To test the performance of our model, we generated 50 simulated datasets from a 
pedigree using the R ( |R Core Team[[2012| ) package pedantics ( |Mqrrissey| |2010[ ) . 
The gryphons pedigree in this package is designed to be relevant to power and sensi- 



tivity analyses for quantitative genetic studies of natural populations (MORRISSEY 



2010). We selected 148 individuals from last cohort of the pedigree and their 39 



mothers as our sample population. All individuals had at least one relative in the 
sample so that the pedigree was moderately informative. We simulated phenotypic 
data using the function phensim. For each simulation, we constructed sparse, mod- 
ular matrices G and R for p = 100 traits. We began by generating a set of /c = 8 
sparse factors with non-zero loadings for 10-50 traits. We then assigned positive 
heritabilities to = 5 of these factors, each drawn from an independent Beta(3, 2) 
distribution. We set = 0.2 x Ip and *e = 0.5 x Ip, and calculated G and R using 
equation (10). In these simulations, the narrow-sense heritabilities (/i^) of the 100 
measured traits ranged from ~ 0.02 — 0.8, with the majority < 0.2. We calculated 
the genetic relationship matrix A as 2-times the matrix of kinship coefficients cal- 



culated from the full gryphons pedigree using the kinship package (Atkinson and 



Therneau 2012). 



We compared our Bayesian sparse factor model to a standard pairwise mixed 
model analysis. We set the prior hyperparameters: u = 3,ai = 2,bi = 1/20, 02 = 
3, 62 = 1- For each simulation, we ran our Gibbs sampler for 12,000 iterations, 
discarded the first 2,000 samples as a burn-in period, and collected 1000 posterior 



samples with a thinning rate of 10. We then used the program WOMBAT (Meyer 



2007) to fit each of the p{p + l)/2 genetic covariances in the G-matrix separately 



using REML. Our Bayesian sparse factor model provided more accurate estimates 
of these genetic covariances (Figure [T]). On average, the mean square error (MSE) 
of the covariance estimates, a measure of absolute accuracy, was 33% lower with 
the factor model than with the pairwise analysis (Figure [2]A.). The improvement 
approached 80% in cases when the data were more informative, but the quality 
of the fits converged in simulations in which neither model estimated the genetic 
covariances particularly well. The correlation between estimated and actual genetic 
covariances, a measure of the accuracy of relative magnitudes of covariances among 
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genes, was consistently higher with the factor model, with an average improvement of 
> 40% (Figure |2]B). The factor model's covariance estimates were positively biased, 
while the pairwise REML estimates were not (Figure [2p). h"^ estimates from the 
factor model had similar MSE to the REML estimates, but were also consistently 
positively biased. 




-1 1 

correlation 



Figure 1: The Bayesian genetic sparse factor model accurately estimates 
the G-matrix from a pedigree. A. Matrix of simulated genetic correlations. 
A G-matrix was simulated for 100 traits with five factors (see Methods). The G- 
matrix was normalized to have unit genetic variances on the diagonal for visual 
clarity. Phenotypic data from 187 individuals simulated from the gryphons pedigree 
(MoRRlSSEY, 2010) was generated given this G-matrix. B. Matrix of posterior 
mean genetic correlations estimated from these simulated individuals by the Bayesian 
genetic sparse factor model for 100 traits. C. Matrix of genetic correlations estimated 



by REML from two-trait animal model analyses run using WOMBAT (Meyer 



2007). All panels share the same color scale. 



Importantly, the Bayesian sparse factor model produced estimates of the G- 
matrix that were positive definite and accurately fit the underlying latent trait struc- 
ture. In our simulations with 100 traits, element-wise estimates of the G-matrix by 
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Figure 2: Joint estimation across all traits leads to quantitative improve- 
ments in parameter estimates. Pairwise genetic correlations among 100 traits 
from simulations of 187 individuals were estimated by two methods: 1) The Bayesian 
sparse factor model for 100 traits; 2) Repeated (pairwise) REML using WOMBAT. 
Panels A-C compare the accuracy and bias of the sparse factor model estimates 
to the pairwise REML estimates across 50 simulations. Each point represents the 
results of one simulation. A. Mean squared error of genetic covariances. B. Squared 
Pearson correlations of estimated and actual genetic covariances. C. Average ab- 
solute bias in genetic covariances. D. Accuracy of latent trait heritabilities. For 
each simulation, fitted latent traits were matched to simulated factors by calculating 
the (absolute) correlation between the C(ll2imns of A and the true trait loadings on 
each factor. In this plot, known heritabilities of each of the eight simulated factors 
(x-axis) are compared to the heritability estimates (y-axis) of the most correlated 
latent traits in that simulation. All 50 simulations are combined. The diagonal line 
in each plot shows y = x. 



REML had 40-48 negative eigenvalues, and thus could not easily be inverted in cal- 



culations of genetic selection gradients (Rausher 1992). In all 50 simulations, the 
columns of A included each of the 8 simulated latent traits. And, because of the 
sparsity prior, no a posteriori rotation of the factors was needed for interpretability 
(e.g., Meyer 2009, Figure |3]). Furthermore, the estimates of the heritabilities of 
latent traits were accurate (r = 0.87 over all 50 simulations, after matching up each 
simulated latent traits with the most correlated column of A, Figure p| 



3.2 Gene expression example: 

We downloaded gene expression profiles of 40 wild-derived lines of Drosophila melanogaster 



from ArrayExpress (accession: E-MEXP-1594, Ayroles et al. 2009) and used our 
Bayesian sparse factor model to infer an among-line gene expression covariance 
matrix for a subset of the genes. We first normalized the processed gene expres- 
sion data to correspond to the the analyses of the earlier paper and then selected 



the 414 genes that Ayroles et al. (2009) identified as having a plausible among- 
line covariance with competitive fitness. We also downloaded competitive fitness 
data for each line from the Drosophila Genetic Reference Panel (DGRP) website 



(http://dgrp.gnets.ncsu.edu/) so that we could estimate the among line covariance 
between each gene and this phenotypic trait. 

In this dataset, two biological replicates of male and female fly collections from 
each line were analyzed for whole-animal RNA expression. The competitive fitness 
measurements were means of 20 competitive trials done with sets of flies from these 
same lines, but not the same flies used in the gene expression analysis. Therefore, 
gene expression values for the samples measured for competitive fitness and com- 
petitive fitness values for the samples measured for gene expression were treated as 
missing data (see Appendix). We used our model to estimate the G-matrix of the 
genes (in this case, the covariance of line effects). Following the analyses of Ayroles 



et al. (2009), we included a fixed effect of sex, and independent random effects of the 



sex:line interaction for each gene. No sex or sex:line effects were fit for competitive 
fitness itself as this value is measured at the level of the line, not individual files. 

We set the prior hyperparameters as above, and ran our Gibbs sampler for 40,000 
iterations, discarded the first 20,000 samples as a burn-in period, and collected 1,000 
posterior samples of all parameters with a thinning rate of 20. Our estimate of the 
G-matrix was qualitatively similar to the original estimate (Figure IlK, and compare 
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Figure 3: The Bayesian genetic sparse factor model recovers the underly- 
ing latent factors in simulated data. These panels represent the same simulated 
dataset as in Figure [Tj A. Matrix of phenotypic correlations (P = G + R, normalized 
to have unit variances). B. Simulated latent traits. Each column represents the load- 
ings of the 100 measured traits on one of the eight simulated latent traits. Traits 1-5 
were assigned non-zero heritablities. C. Posterior mean estimates of the latent traits 
under the Bayesian genetic sparse factor model. The model selected 13 factors, but 
only eight had large loadings for any of the measured traits. D. Pearson correlations 
(r) between the loadings of the eight simulated traits, and posterior means of the 
13 factors. All eight simulated latent traits were recovered in the estimated factors 
with |r| > 0.96. Panels A and D share the same color scale, as do panels B and C. 
The latter color scale is truncated for clarity. 
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to Figure 7a in Ayroles et al. (2009)). However, the estimate from our factor model 
was positive definite, while the original estimate was not since it was calculated in a 
pairwise fashion. Nevertheless, estimates of the broad-sense heritability of each gene 
were similar (r = 0.74). 



Using the Modulated Modularity Clustering (MMC) algorithm (Stone and Ay- 



ROLES 



2009), Ayroles et al. (2009) identified 20 modules of genetically correlated 



transcripts post-hoc. In our factor model, modules are estimated simultaneously with 
the G-matrix itself. Each factor (column of A) represents a sparse set of genes that 
are highly correlated in their expression, possibly due to common regulation by some 
latent developmental trait. Our model identified 27 such latent factors (Figure |4^). 
Of these factors, 13-16 of them were consistently identified (r > 0.95) across 3 paral- 
lel chains of the Gibbs sampler, and most of the rest were minor, each accounting for 
less that 1% of the variance in any gene. Many factors were similar, but not identical 
to the modules identified by MMC (Figure |4^). Some of the factors were nearly 
one-to-one matches to modules (e.g., factor 10 with module 8, and factor 14 with 
module 12). However, others merged together two or more modules (e.g., factor 1 
with modules 7 and 9, and factor 2 with modules 4, 13, 16-20). And some entire 
modules were part of two or more factors (e.g., module 17 was included in factors 2 
and 4, and module 18 was included in factors 2 and 16). 

One reason for the discrepancy between our factor model and the MMC results 
is that our model allows each gene to belong to more than one of the latent traits. A 
second difference is that our model infers factors at the level of phenotypic variation, 
rather than the among-line covariances. The broad-sense heritability [H'^) of the 
latent traits (factors) ranged from 0.03 to 0.90 (Figure |4j3). The majority of the 
factors, had intermediate H^, between 0.1 and 0.65, but 5 were largely genetic with 

> 0.75. 

Although a functional analysis would be more powerful if more genes were studied 
simultaneously, the latent traits defined by the modules do capture intriguing bio- 
logical relations: using the Database for Annotation, Visualization and Integrated 
Discovery (DAVID) v6.7 (Huang et al, 2009a[b ), several of the factors were indi- 
vidually enriched (within this set of 414 genes) for proteins related to processes such 
as: defense and immunity, nervous system function, odorant binding, transcription 
and cuticle formation. Similar molecular functions were identified among the mod- 
ules identified by Ayroles et al. (2009). These authors highlighted modules 7-9 in 
particular - modules 7 and 9 contained largely female-biased genes, while module 8 
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A 



Genetic correlations 



Gene loadings on latent traits 




genes factors 



Figure 4: Among-line covariance of gene expression and competitive fit- 
ness in Drosophila is modular. Gene expression (414 genes) and competi- 
tive fitness data for 40 wild-derived lines of Drosophila melanogaster were down- 
loaded from ArrayExpress (accession: E-MEXP-1594) dAYROLES eTal] [2009| and 
[http:/ /dgrp.gnets.n csu.eduTl A. Genetic (among-line) architecture of gene expres- 
sion traits. The three panels show: i) Posterior mean broad-sense heritabilities {H"^) 
for the 414 genes, ii) Posterior mean genetic correlations among these genes, and iii) 
Posterior means and 95% highest posterior density (HPD) intervals around estimates 
of genetic correlations between each gene and competitive fitness. For comparison, 
see Figure 7a of ( Ayroles et al. , 2009[ ). B. Latent trait structure of gene expression 
covariances. The three panels show: i) Posterior mean for each estimated latent 
trait, ii) Posterior mean gene loadings on each latent trait, and iii) Posterior means 
and 95% (HPD) intervals around estimates of genetic correlations between each la- 
tent trait and competitive fitness. The right-axis of panel B. groups genes into 



modules inferred using Modulated Modiiferity Clustering (Stone and Ayroles 



2009; Ayroles et al. 2009). 



contained male-biased genes - and hypothesized that negative genetic correlations 
between genes in the female- and male-biased modules could act to maintain genetic 
variation in fitness. In our model, modules 7 and 9 are largely grouped into factor 1, 
which also weakly includes the genes of module 8. However, factor 10 is the major 
contributor to genetic variation in the genes of module 8. 

Finally, by adding a competitive fitness as a 415th trait in the analysis, we could 
estimate the among-line correlation between the expression of each gene and this 
fitness-related trait (Figure |4|\). Many (60/414 ~ 15%) of the 95% highest poste- 
rior density (HPD) intervals on the among-line correlations did not included zero, 
although most of these correlations were low (for 85% of genes, \r\ < 0.25) with a few 
as large as |r| ~ 0.45. We also estimated the genetic correlation between competitive 
fitness and each of the latent traits defined by the 27 factors (Figure |4|3). Here, most 
correlations were nearly zero. However, the genetic correlations between competitive 
fitness and factors 2 and 16 were large and highly significant, suggesting potentially 
interesting genetic relationships between these underlying traits and fitness. 

4 Discussion 

The Bayesian sparse factor model performs well on both simulated and real data, 
and thus opens the possibility of incorporating diverse and highly complex traits 
into evolutionary genetic studies and breeding programs. Gene expression traits in 
particular provide a way to measure under-appreciated molecular and developmen- 
tal traits that may be important for evolution, and technologies exist to measure 
these traits on very large scales. Our model can also be applied to other molecular 
traits (e.g., metabolites or protein concentrations), high dimensional morphological 
traits (e.g., outlines of surfaces from geometric morphometries), or gene-environment 
interactions (e.g., the same trait measured in multiple environments). 

4.1 Scalability of the method: 

The key advantage of the Bayesian sparse factor model over existing methods is 
its ability to provide robust estimates of covariance parameters for datasets with 
large numbers of traits. In this study, we demonstrated very high performance 
of the model for 100 simulated traits, and robust results on real data with 415. 
Similar factor models (without the genetic component) have been applied to gene 
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expression datasets with thousands of traits (IBhattacharya and DunsonI 120111) 



and we expect the genetic model to perform similarly. The main limitation will be 
computational time, which scales roughly linearly with the number of traits analyzed 
(assuming the number of important factors grows more slowly). Parallel computing 
techniques may speed up analyses in cases of very large numbers of traits. 

The main reason that our model scales well in this way is that under our prior, 
each factor is sparse. Experience with factor models in fields such as gene expression 



analysis, economics, finance, and social sciences (Fan et ai, 2011), as well as with 



genetic association studies (e.g., Engelhardt and Stephens 2010 Stegle et al. 



2010; Parts et al. 2011) demonstrates that sparsity (or shrinkage) is necessary to 
perform robust inference on high-dimensional data (BiCKEL and Levina 2008bpi 



EL Karoui 2008 Meyer and Kirkpatrick 2010). Otherwise, samphng variabil- 



ity can overwhelm any true signals, leading to unstable estimates. Here, we used the 



i-distribution as a shrinkage prior, following ( Bhattacharya and DuNSON, 2011) 



but many other choices are possible (Armagan et al. . 2011) 



4.2 Applications to evolutionary quantitive genetics: 

The G-matrix features prominently in the theory of evolutionary quantitative ge- 
netics, and its estimation has been a prominent goal of many experimental and 



observational studies (Walsh and Blows, 2009). Since our model is built on the 



standard mixed effect model framework, it is fiexible and can be applied to many 
experimental designs or studies. And since our model is Bayesian and naturally 
produces estimates within the parameter space, posterior samples from the Gibbs 
sampler provide convenient credible intervals for the G-matrix itself and many evo- 
lutionarily important parameters, such as trait-specific heritabilities or individual 



breeding values (Sorensen and GlANOLA, 2010). 



An important use of G-matrices is to predict the response of a set of traits to 



selection (Lande, 1979). Applying Robertson's 2nd theorem of natural selection. 



the response in y will equal the additive genetic covariance between the vector of 
traits and fitness (Ay = cr^(y,M;)) ([Rausher] |1992[ |Walsh and Blows"} |2009[). 



This quantity can be estimated directly from our model if fitness is included as the 
P* = {p+ l)th trait: 



Ay 



,A„ , 



where A, 



7p* 



contains all rows of A^ except the row for fitness, and A„^, contains 
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only the row of corresponding to fitness. Similarly, the quantity 1 — \l'u^, /Gp* ^p* 
equals the percentage of genetic variation in fitness accounted for by variation in the 
measured traits (Walsh and Blows, 2009), which is useful for identifying other 
traits that might be relevant for fitness. 

On the other hand, our model is not well suited to estimating the dimensionality 
of the G-matrix. A low-rank G-matrix means that there are absolute genetic con- 
straints on evolution (Lande, 1979). Several methods provide statistical tests for the 



rank of the G-matrix (e.g., |HlNE and BLOWs|2006t|KlRKPATRlCK and Meyer|2004 



Mezey and HoULE 2005). We use a prior that shrinks the magnitudes of higher 



index factors to provide robust estimates of the largest factors. This will likely have 
a side-effect of underestimating the total number of factors. However, absolute con- 
straints appear rare (HoULE, 2010), and the dimensions of the G-matrix with the 
most variation are likely those with the greatest effect on evolution in natural pop- 
ulations ( |Schlute"r| |1996[ |Kirkpatrick| |2009[ ). Our model should estimate these 
dimensions well. From a practical standpoint, pre-selecting the number of factors 
has plagued other reduced-rank estimators of the G-matrix (e.g., Kirkpatrick and 



Meyer 2004; Hine and Blows 2006; Meyer 2009). Our prior is based on an infi- 



nite factor model (Bhattacharya and DuNSON, 2011), and so no a priori decision 
is needed. Instead, the parameters of the prior distribution become important mod- 
eling decisions. In our experience, a relatively diffuse prior on 6i with a2 = 3 tends 
to work well. 



4.3 Biological interpretation of factors: 



Genetic modules are sets of traits likely to evolve together. By assuming that the 
developmental process is modular, we can model each latent trait as affecting a 
limited number of phenotypic traits. Other techniques for identifying genetic modules 



include the MMC algorithm (Stone and Ayroles, 2009 Ayroles et al. 2009), and 



spectral decomposition which treats each major eigenvector as an estimate of such 
an underlying module (e.g., McGRAW et a/.||2011 ). The former technique constraints 
each trait to belong to only one module, while the biological interpretation of the 
latter is unclear because of the mathematical constraint that the eigenvectors be 



orthogonal (HANSEN and HoULE, 2008). In classic factor models (such as proposed 



by Meyer (2009), or de Los Campos and Gianola (2007)), the factors are not 



identifiable (Meyer 2009), and so the identity of the underlying modules is unclear. 
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Under our sparsity prior, the factors are identifiable up to a sign-flip (each factor can 
be multiplied by —1 without affecting its probability under the model). However, in 
simulations and with the Drosophila gene expression data, our Gibbs sampler (see 
the Appendix) chooses a single sign of each factor for long stretches of each chain. 
Also, independent chains identify the same dominant factors. The ordering of the 
factors is also constrained in our model. The prior on 6h makes factors with large 
loadings on large numbers of traits increasingly improbable for higher-index factors. 
Under vague priors on Sh, the order of similarly indexed factors can be different 
among MCMC chains. In general, the order of the factors is not of great biological 
interest. However, as more high-dimensional datasets are created and studied, more 
informative priors on 6h may be justified and will likely reduce this problem. 

A unique feature of our model is the fact that we estimate genetic and envi- 
ronmental factors jointly, instead of separately as in classic multilevel factor models 
(e.g., Gqldstein| 2010 ) . If each factor represents a true latent trait (e.g., variation 
in a developmental process), it is reasonable to decompose variation in this trait 
into genetic and environmental components. We directly estimate the heritability 
of the traits underlying each factor, and therefore can use our model to predict the 
evolution of these latent traits. 



4.4 Extensions: 



Our model is built on the classic mixed effect model common in quantitative genetics 



(Henderson, 1984). It is therefore straightforward to extend to models with addi- 



tional fixed or random effects (e.g., dominance or epistatic effects) for each trait. The 
update equation for hj in the Gibbs sampler described in the Appendix however does 
not allow additional random effects in the model for the latent factors themselves (f,- 



in equation ([8j)), although other formulations are possible. A second extension relates 
to the case when the relationship matrix among individuals (A) is unknown. Here, 
relationship estimates from genotype data can be easily incorporated. As such, our 
model is related to a recently proposed sparse factor model for genetic associations 



with intermediate phenotypes (Parts et ai, 2011). These authors introduced prior 



information on genetic modules from gene function and pathway databases which 
could be incorporated in our model in a similar way. 
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5 Conclusions 



The Bayesian sparse factor model for genetic analysis that we propose provides a 
novel approach to genetic estimation with high- dimensional traits. We anticipate 
that incorporating many diverse phenotypes into genetic studies will provide pow- 
erful insights into evolutionary processes. The use of highly-informative but biolog- 
ically grounded priors is necessary for making inferences on high-dimensional data, 
and can help identify developmental mechanisms underlying phenotypic variation in 
populations. 



6 Appendix 

6.1 Posterior sampling: 

We estimate the posterior distribution of the Bayesian genetic sparse factor model 



with an adaptive Gibbs sampler based on the procedure proposed by B hattach ARYA 



and DuNSON (2011). The value k* at which columns in A are truncated is set using 



an adaptive procedure ( Bhattacharya and DuNSON 2011). Given a truncation 



point, the sampler iterates through the following steps: 

1. If missing observations are present, values are drawn independently from uni- 
variate normal distributions parameterized by the current values of all other 
parameters: 

7c{y,j I -) ~ N (x(^')bi + f (^^A, + z(^)5„ 

where i/ij is the imputed phenotype value for the i-th trait in individual j. The 
three components of the mean are: x'^-'^ the row vector of fixed effect covariates 
for individual j times bj, the ith column of the fixed effect coefficient matrix; 
f*^-'^ the row vector of factor scores on the k* factors for individual j times Aj, 
the row of the factor loading matrix for trait i; and z^^\ the row vector of the 
random (genetic) effect incidence matrix for individual j times 6i, the vector of 
residual genetic effects for trait i not accounted for by the k* factors. Finally, 
cjj"^ is the residual precision of trait i. All missing data can be drawn in a 
single block update. 
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2. The fixed efi'ect coefficient matrix B, the truncated factor loading matrix A^* 
and the residual genetic effects matrix A can be stacked into a single matrix, 
and then its columns factor into independent multivariate normal conditional 
posteriors: 



TT 



K 



where W and C are defined as: 



W = [X F Z] 




"o Diag(0,7T/) 



3. The conditional posterior of the factor scores F is a matrix variate normal 
distribution: 

TT (F I -) ~ MN,,,. (c-i (Y*;iAk* + ZF„Diag(l - h^,)-^^ , C-^) 
where C is: 

C = A^.*;^A,*+Diag(l-/i2)-i 

and Y is: 

Y = Y - XB - ZA. 

4. The conditional posterior of the genetic effects on the factors, F„ factors into 
independent multivariate normals for each factor f«^, m — 1 . . . k* st ^ 0: 



MN {C-\l-hl)-'ZF^,C-') 



where C is: 
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5. The conditional posterior for each of the latent factor heritabihties h^,m — 
1 . . .k* is calculated by integrating out F„ and summing over all possibilities 
of /i^, since the prior on this parameter is discrete: 

N (F^ I 0, /i^ZAZ^ + (1 - h'')ln) n{hl = h^) 



E N (F^ I 0, /i^ZAZ^ + (1 - 7r(/ii = hj\ 

1=1 



where N(x | /x, S) is the multivariate normal density with mean /i and variance 
E, evaluated at x, hf — l/hh, and in general, 7r(/i^ = h\) — l/uh- Given this 
conditional posterior, is sampled from a multinomial distribution. 

6. The conditional posterior of the trait-factor loading variance (pih for trait i on 
factor h is: 



7r(0j/i I -) Ga 



2 ' 2 

7. The conditional posterior of 5^, m = 1 ... A;* is as follows. For 6i: 



tt{6i I — ) ~ Ga I tti 
and for Sh, h > 2: 

T^i^h I -) ~ Ga 



1=1 3=1 / 



(k* p \ 

l=h 3=1 J 



where r/'^'' = fj 5t ioi: h = 1 . . .k* . 

t=l,t^h 

8. The conditional posteriors for the precision of the residual genetic effects of 
trait i, ipuu, is: 



9. The conditional posteriors for the model residuals of trait i, ^, is: 



TT 



[af I -) ^ Ga [ar + + ^ E iy^o - x^^'^b, - f (^^A, - z(^)5,)' j 
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Other random effects, such as the hne x sex effects modeled in the gene expression 
example of this paper can be incorporated into this sampling scheme in much the 
same way as the residual genetic effects, A, are included here. 
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